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Foreword 



The fourth edition of the European Conference on Geostatistics for 
Environmental Applications (geoENV IV) took place in Barcelona, 
November 27-29, 2002. As a proof that there is an increasing interest in 
environmental issues in the geostatistical community, the conference 
attracted over 100 participants, mostly Europeans (up to 10 European 
countries were represented), but also from other countries in the world. Only 
46 contributions, selected out of around 100 submitted papers, were invited 
to be presented orally during the conference. Additionally 30 authors were 
invited to present their work in poster format during a special session. 

All oral and poster contributors were invited to submit their work to be 
considered for publication in this Kluwer series. All papers underwent a 
reviewing process, which consisted on two reviewers for oral presentations 
and one reviewer for posters. The book opens with one keynote paper by 
Philippe Naveau. It is followed by 40 papers that correspond to those 
presented orally during the conference and accepted by the reviewers. These 
papers are classified according to their main topic. The list of topics show 
the diversity of the contributions and the fields of application. At the end of 
the book, summaries of up to 19 poster presentations are added. 

The geoENV conferences stress two issues, namely geostatistics and 
environmental applications. Thus, papers can be classified into two groups. 
The reader will find a number of papers dedicated to the most recent 
methodological developments, with examples predominantly in 
environmental sciences. The remaining ones provide a good indication of 
the wide variety of environmental applications in which geostatistics plays 
its role. 

The fourth volume in the geoENV conference series proves how 
dynamic the geostatistical community is, and confirms the relevance of 
geostatistics as a tool to be included as a standard procedure in 
environmental sciences. We now look forward to geoENV 2004 for new 
applications and new methodological advances. 
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Abstract: This research focuses on two original statistical methods for analyzing large 

data sets in the context of climate studies. First, we propose a new way to 
introduce skewness to state-space models without losing the computational 
advantages of the Kalman filter operations. The motivation stems from the 
popularity of state-space models and statistical data assimilation techniques in 
geophysics, specially for forecasting purposes in real time. The added 
skewness comes from the extension of the multivariate normal distribution to 
the general multivariate skew-normal distribution. A new specific state-space 
model for which the Kalman Filtering operations are carefully described is 
derived. The second part of this work is dedicated to the extension of 
clustering methods into the distributions of distributions } framework. This 
concept allows us to cluster distributions, instead of simple observations. To 
illustrate the applicability of such a method, we analyze the distributions of 
16200 temperature and humidity vertical profiles. Different levels of 
dependencies between these distributions are modeled by copulas. The 
distributions of distributions are decomposed as mixtures and the algorithm to 
estimate the parameters of such mixtures is presented. Besides providing 
realistic climatic classes, this clustering method allows atmospheric scientists 
to explore large climate data sets into a more meaningful and global 
framework. 
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P. Naveau, M. Vrac, M. G. Genton, A. Chedin and E. Diday 



1. INTRODUCTION 

In geophysical studies, the dimension of data sets from most oceanic, 
atmospheric numerical models and satellites is extremely large. There exists 
a variety of recent techniques to deal with such an issue in the special 
context of climate studies. For example, Bayesian methods (e.g. Wikle et al., 
2002), data mining, imaging and statistical visualization procedures have 
provided interesting and innovative ways to analyze large climatic data sets. 
In addition to the computational problem, the distribution of climatic random 
vectors is often supposed to be Gaussian or a mixture of Gaussian 
distributions, although this assumption is not always satisfied for a wide 
range of atmospheric variables. For example, the distribution of daily 
precipitation amounts is by nature skewed. In this paper, we attend to 
address these two problems, large size and skewness, with two different 
approaches. Because the scope of these problems is very large, we will focus 
our attention on two specific statistical methods used in climate studies. In 
Section 2, we will present a simple way to incorporate skewness in Kalman 
filtering techniques without losing the computational advantages associated 
with the normal distribution (Naveau and Genton, 2002). In Section 3, the 
concept of distributions of distributions (Diday et al., 1985; Vrac 2002; Vrac 
et al., 2001) will be used in order to improve classical clustering methods for 
large climatic data sets. This application is closely linked to the algorithm of 
inversion of the equation of radiative transfer (Chedin et al., 1985). 



2. GENERAL SKEWED KALMAN FILTERS 

Before presenting the details of our research on Kalman filters, we want 
to clarify some climatic terms to the statistician who may not be familiar 
with atmospheric sciences. In particular, we would like to recall the meaning 
of numerical models and data assimilation in the context of this work. For 
the former, a numerical computer model solves the governing physical, 
thermodynamics and micro-physical processes at different scales of interest 
and over a specific region (depending on the scientific problem under study). 
It provides deterministic outputs of different atmospheric variables 
(temperature, humidity, winds, etc) according to certain forcings (inputs). It 
is worthwhile to note that the evaluation of such computer simulations has 
generated an interdisciplinary effort between scientists and statisticians in 
recent years. The interested reader can look at Berk and collaborators' work 
(Berk et al., 2002) on the statistical assessment of such models. Data 
assimilation can be seen as a way of incorporating observations into a 
numerical model as it runs. From a statistical point of view, the objective of 
data assimilation is to use both sources of data, observations and model 
outputs, to provide a better statistical analysis, in particular to give better 
forecasts. In the context of numerical weather prediction, updates and 
forecasts have be performed routinely and in real time. This compounds with 
the large size of data sets and implies that very efficient but slow methods 
have to be disregarded. The data-assimilation or update step is closely 
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related to Kalman filter which is the best known filtering algorithm in the 
context of Gaussian distributions and linear system dynamics. Before 
presenting the details of our method, we would like to underline there exists 
a very large body of literature dedicated to Kalman filters and its extensions. 
For example, ensemble Kalman filter (Bengstton et al, 2002; Anderson 
2001), space-time Kalman filters (Wikle and Cressie, 1999), partially Non- 
Gaussian state-space models (Shepard, 1994) and particle filters (Doucet et 
al., 2001) have been recently used for many different applications. In 
particular, the approximation of non-Gaussian distributions by a mixture of 
Gaussian distributions has already been implemented via Monte-Carlo 
methods. In the same way, a mixture of general skew-normal distributions 
could be defined and extends the range of applications of our method. But 
because of the limited space available, we will restrict our exposition to the 
simplest form of Kalman filter, the linear one, in this paper. Future work will 
present the extension to more complex Kalman filter models. 

The overwhelming assumption of normality in the Kalman filter literature 
can be understood for many reasons. A major one is that the multivariate 
distribution is completely characterized by its first two moments. In addition, 
the stability of multivariate normal distribution under summation and 
conditioning offers tractability and simplicity. Therefore, the Kalman filter 
operations can be performed rapidly and efficiently whenever the normality 
assumption holds. However, this assumption is not satisfied for a large 
number of applications. For example, some distributions used in a state- 
space model can be skewed. In this work, we propose a novel extension of 
the Kalman filter by working with a larger class of distributions than the 
normal distribution. This class is called general multivariate skew-normal 
distributions. Besides introducing skewness to the normal distribution, it has 
the advantages of being closed under marginalization and conditioning. This 
class has been introduced by Dominguez-Molina et al. (2001) and is an 
extension of the multivariate skew-normal distribution first proposed by 
Azzalini and his coworkers (1996, 1999). These distributions are particular 
types of generalized skew-elliptical distributions recently introduced by 
Genton and Loperfido (2002), i.e. they are defined as the product of a 
multivariate elliptical density with a skewing function. 

2.1 The general multivariate skew-normal distribution 

The general multivariate skew-normal distribution is a family of 
distributions including the normal one, but with extra parameters to regulate 
skewness. It allows for a continuous variation from normality to non- 
normality, which is useful in many situations (Azzalini and Capitanio, 1999) 
who emphasized statistical applications for the skew-normal distribution. An 
^-dimensional random vector X is said to have a general multivariate skew- 
normal distribution (Dominguez-Molina et al., (2001)), denoted by 
6MS^ )W (//,X,D, v,A) if it has a density function of the form: 
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<D m (Z)//;v,A + DZD r ) 



0„(x;//,X)<D m (£>x;v,A), 



X G . 



(i) 



where ju e R ", v e R m , X e R ” xw and zl e R are both covariance 
matrices, D e R mx ", fa (x;p,X) and ®„(x;jlx,X) are the ^-dimensional normal 
pdf and cdf with mean p and covariance matrix X- When D = 0, the density 
(1) reduces to the multivariate normal one, whereas it reduces to Azzalini 
and Capitanio's ( 1999 ) density when m = 1 and v = D ju. The matrix 
parameter D is referred to as a “shape parameter”. The moment generating 
function M(t ) for a GMSN distribution is given by: 



M(t) = 



<D m (D(// + ^);v,A + Z)ZZ) r ) 
® m (Diu;v,A + DZD T ) 



exp < ju t + 






t g R”. (2) 



The simulation of random vectors from the GMSN distribution is rather 
simple. Indeed, Domi}nguez-Molina et al. (2001) showed that ifX g R n and 
Y g R m are two random vectors with joint distribution given by: 



( x ) 
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'(/A 


' E 


-xz/ v 
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A V / 


’ V -DE 
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then the conditional distribution of X given Y < Dju is a general multivariate 
skew-normal distribution GMSN nm (pJ],D,v,A). 

The three basic tools when implementing the Kalman filter are the closure 
under linear transformation, under summation and conditioning. In section 
2.3, we will present how the general skew-normal distribution behaves under 
such constraints. 

2.2 The state-space model and the Kalman filter 

The State Space Model has been widely studied (e.g. Cressie and Wilke, 
2002; Shepard, 1994; Shumway and Stoffer, 1991; Harrison and Stevens, 
1976). This model has become a powerful tool for modeling and forecasting 
dynamical systems and it has been used in a wide range of disciplines such 
as biology, economics, engineerings and geophysics (Naveau et al. 2002; 
Guo et al., 1999; Kitagawa and Gersch, 1984). The basic idea of the state- 
space model is that the ^-dimensional vector of observation Y t at time t is 
generated by two equations, the observational and the system equations. The 
first equation describes how the observations vary in function of the 
unobserved state vector X t of length h: Y t = F t X t + z t where s t represent an 
added noise and F t is a d x h matrix of scalars. The essential difference 
between the state-space model and the conventional linear model is that the 
state vector X t is not assumed to be constant but may change in time. The 
temporal dynamical structure is incorporated via the system equation: X t = G t 
X t .\ + 7j t where 7j t represents an added noise and G t is an h x h matrix of 
scalars. There exists a long literature about the estimation of the parameters 
for such models. In particular, the Kalman filter provides an optimal way to 
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estimate the model parameters if the assumption of gaussianity holds. 
Following the definition by Meinhold and Singpurwalla (1983). The term 
“Kalman filter” used in this work refers to a recursive procedure for 
inference about the state vector. To simplify the exposition, we assume that 
the observation errors s t are independent of the state errors ri t and that the 
sampling is equally spaced, t = 1,..., n. The results shown in this paper could 
be easily extended without such constraints. But, the loss of clarity in the 
notations would make this work more difficult to read without bringing any 
new important concepts. 

2.3 Kalman filtering and general skew-normal distributions 

From Equation (2), it is straightforward to see that the sum of two 
independent general multivariate skew-normal distributions is not necessary 
a general multivariate skew-normal distribution. In order to obtain the 
closure under summation needed for the Kalman Filtering, we extend the 
linear state-space model to a wider state-space model for which the stability 
under summation is better preserved. In order to pursue this goal, we need 
the following lemma. Its proof can be found in Dominguez-Molina et al. 
( 2001 ). 

Lemma 1 Suppose Y = GMSN n m (jU,Tj,D, v,A) and A is a r x n matrix. Then, 
we have X= AY ~ GMSN r>m {AjufLYA J \DA^,v,A) where ,4^ is the left inverse 
of A and A < ~=A' 1 when A is an n x n nonsingular matrix. If Y is partitioned 
into two components, Y\ and 7 2? of dimensions h and n-h respectively and 
with a corresponding partition for //, X, D and v. Then the conditional 
distribution of Y 2 given Y x =y x is: 

GAASN n _ hm (// 2 +X 21 X n — // 1 ),X 22 — X 21 X 11 X 12 ,T) 2 ,v— (4) 

The converse is also true, i.e. if (4) is the conditional distribution of Y 2 
given Y x =y x and Y x ~ GMSNh, m (jU\,Tj\bDi,v h A), then the joint distribution of 
Y x and Y 2 is GMSN nm (ju^D,v,/Y). 

The proof is the same as for the multivariate Gaussian distribution. 

2.4 Extension of the linear state-space model 

Our strategy to derive a model with a more flexible skewness is to directly 
incorporate a skewness term, say S h into the observation equation 

y, = F ' x '+e, 

= Pp t +Q t S t +e t , With F t =(P„Q t ) and X t =(uJ,sf (5) 

where the random vector U t of length k and the d x k matrix of scalar P t 
represent the linear part of the observation equation. In comparison, the 
random vector S t of length / and the d x / matrix of scalar Q t correspond to 
the additional skewness. The most difficult task in this construction is to 
propose a simple dynamical structure of the skewness vector S t and the 
“linear” vector U t while keeping the independence between these two 
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vectors (the last condition is not theoretically necessary but it is useful when 
interpreting the parameters). To reach this goal, we suppose that the bi- 
variate random vector {lf h V T t ) T is generated from a linear system: 

\ U - - ( 6 ) 
k = -L,K-,+n[ 

where the Gaussian noise 77 1 ~ N{\i , 7 , X v ) is independent of rf t ~ N{\x + rj, 
Z%) and where K h respectively L, represents a k x k matrix of scalars, 
respectively a / x / matrix of scalars. The multivariate normal distribution of 
the vector {U T h V T t ) T is denoted by 

f u A fn* 0^ 

y ~ , ' • ( 7 ) 

\ V 'J wJl 0Q JJ 

The parameters of such vectors can be sequentially derived from any 
initial vector {U T 0 , V T 0 ) T with a normal distribution. From (3), we define the 
skewness part S t of the state vector X t = ( U T t , S T t ) T as the following 
conditional variable S t = [V t .\ \ V t < L t y/ t _{\. It follows a general multivariate 
skew-normal distribution S t ~ GMSN/j{ if P \ ,Q + /_ 1 ,L h y/ t ff v ) . Consequently 
the state vector has also a general multivariate skew-normal distribution 

with V t =f\ (8) 



n ,=f n; ! 


ro(n 


( 0 "l 


fl 0) 


, D = 


v, = ^ , 


and A, = 


on;J 


n t 


Xt J 


[ok) 



The price for this gain in skewness flexibility is that this state vector does 
not have anymore a linear structure like the one defined by the system 
equation. If P t = 0 or L t = 0 then the classical state-space model is obtained. 

Proposition 1 Suppose that the initial vector {U T o, V T 0 ) T of the linear system 
defined by (6) follows the normal distribution defined by 

fuA __ oVi 



u *\ 

Vo 


* o 

a 


0 T 


XvY 


\0 


«o + J, 



Then both the state vector X t = ( if t , S T i) T and the observation vector Y t 
follow general multivariate skew-normal distributions, X, ~ 
GMSN Km { y/ h Q h D h y/ h A t ) and Y t ~ GMSN d Jjj i ,T h E h v h A t ) for t > 1. The 
parameters of these distributions satisfy 

V,= K ,vU+X V] = ~ L ,vU + K and /u t = F t y/ t + /u e , 
and 

Q]=K t nlX + K’ £K= L PiX+K and T <= F PtX+^s 



E t =D t Fy,D t =D t _ x G^ and v, =((/V, +r )\ 



The proofs of our propositions about the Skewed Kalman filter can be 
found in Naveau and Genton (2002). 
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2.5 Sequential estimation procedure: Kalman Filtering 

To extend the Kalman filter to general skewed normal distributions, we 
follow the work of Meinfold and Singpurwalla (1983) who derived a 
Bayesian formulation to derive the different steps of the Kalman filtering. 
The key notion is that given the data Y t = (7i,..., Y t ), inference about the state 
vector values can be carried out through a direct application of Bayes' 
theorem. In the Kalman literature, the conditional distribution of (X r l | Y/_i) 
is usually assumed to follow a Gaussian distribution at time t- 1. In our case, 
this assumption at time t - 1 is expressed in function of the general 
multivariate skew-normal distribution: 

( Jf,-, |i;-, ) = < 3 JWSAf„ , n,_, , A-, > , A,_, ) , ( 10 ) 



where . represents the location, scale, shape, and skewness parameters of (X r 
1 | Y ? _i). Then, we look forward in time t , but in two stages: prior to 
observing Y h and after observing Y t . To implement these two steps, Lemma 1 
is used to determine the conditional distribution of a general multivariate 
skew-normal distribution. 

Proposition 2 Suppose that the initial vector (U T o, V T f) T follows the normal 
distribution defined by (9), that the posterior distribution ofX t follows (10) at 
time t - 1 and that we have for U t and V t introduced in (5) 



(U, 11 


\ 


t 


V ~ ^ 

Vt- 1 

kVU) 








Y ,- , 
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~ N k+1 
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£V + 

V“|-! ii (-l J J 



where . represents the posterior mean and covariance. We define the 
following quantities: R + t = L t q + tA L t + i, R* t = Q + m K t + 

s, = QXQl + P,KF + V and ft = L, (ft-i + C t Pj^P,C, ) L] 



and e t = Y t - Q t [K t v Vi + v\] ~ p t [E(S t \Y tA )] - ju e , , where E{S t \Y tA ) is the 



conditional expectation of S t given Y t .\ and C t is the conditional covariance 
C t = cov(V t - U St | Y ? _i). The parameters of the posterior distributions are 
computed through the next cycle by the following sequential procedure: 



(X, |Y ,)~GMSN k+lMl 



(¥,A t ,D t ,v„k,) 


|, with \j) t = 


V, 
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and where 
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w 




v^ + V 





K t y?;_ l + jul + R]Q T t ^- x e t > 
t~ i + Mrj ~ L t C t P t 'E t e t J 



and 



A; nf 




' K-m X'eX 


A + d + , 




-i&tf + l,c,pX'QX K - L , c , p !^' p ,c A , 



Although the notations are a little more complex, the Kalman filtering 
steps for the skewed extended state-space model does not present any 
particular computational difficulties. 



3. DISTRIBUTIONS OF DISTRIBUTION WITH 
APPLICATION TO CLIMATOLOGY 



3.1 Motivations and data 

The data set under study comes from the European Center for 
Meteorological Forecasting (ECMWF). The temporal resolution is of 6 hour 
(0 a.m., 6 a.m., 12 a.m., 6 a.m.) and the data covers the period from 
December 1998 to December 1999. For each latitude and each longitude, the 
values of different atmospheric variables (pressure values, temperature, 
specific humidity, winds, etc) are available at 50 different vertical levels. 
These levels are not equally spaced and vary from one location to another. 
This implies that we can not choose a specific altitude (or pressure level) and 
simply apply classical methods at different chosen altitudes. Despite this 
difficulty, the atmospheric scientist would like to summarize the information 
contained in this multi-variate 3D grid into a 2D map, i.e. on the surface of 
the Earth. Being able to recognize different climatic behaviors is of 
particular interest. An accurate partition of these vertical profiles is essential 
to interpret satellite observations into atmospheric variables (inversion of 
equations of radiative transfer, Chedin et al., 1985). From a statistical point 
of view, we rephrase this scientific question as a clustering problem, 
classifying multi-variate vertical profile distributions into clusters with 
similar physical properties inside a cluster and distinct physical 
characteristics between clusters. Consequently, a fundamental difference 
with classical clustering algorithms is that a classification method has been 
directly applied to distributions (vertical profiles) instead of observations. As 
an application, 16200 multi-variate vertical profile distributions have to be 
decomposed as a mixture of K=1 classes. This number was chosen by 
atmospheric scientists and each class should correspond to a specific 
climatic situation. The distributions will either be of temperatures, 
humidities, or both. To illustrate the clustering procedure, we will focus on a 
particular date (the 15th of December 1998 at midnight). Before showing the 
results of this analysis, we need to establish a basic statistical framework. 
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3.2 Defining distributions of distributions 

Suppose that the vector F=(F /v .., F n ) represents the temperature vertical 
profile distributions over the entire globe. To work with such sets of 
distributions, the concept of distributions of distributions developed by 
Diday (2001) is needed. The details of the clustering methodology of 
distribution of distributions can be found in the work by Vrac (2002, 2001). 
Let t be a real. A distribution function of distributions is defined by 

D t (x ) = e fl F such that F(t) < x}), 

where fl F is the set of all possible temperature distributions. From a more 
practical point of view, D t (x) could be estimated by 

D t (x) = -f j I[F i (t)<x], with F i {t) = -f J I[X ij <t], 

n i = i n. j=1 

where I[A\ represents the indicator function, equal to 1 if A true and 0 

otherwise, andF^) denotes the empirical distribution of the z'th profile that 
has n, observations. Although this estimation strategy has the advantage of 
being simple, the clustering algorithm converges slowly due to the step- 
functions. Instead, we use the “Parzen estimation method” to model the 
vertical profile distributions 

y;.(x) = -Ljr^f x -Lf l and F(*)=f f£x)dx 

n > h i j = i l h < J J “ c0 

where K is a kernel function and h t the window width (Silverman, 1986). 
Because the density d t (x)=D' t (x) takes its values on [0,1], we choose to 
model it by a Beta density 

F, (*) = 3 P \ Pr \ xP "' (l ~ X F ’ ^ r t =(Pt’ v t) >0 ( 12 ) 
r ( p<) t M 

Hence, d t , r t (x)= d t , r t (u) with r t estimated from the sample { f z (0} 
with /=1,...,Z7. 

For the practitioner, studying the relationship between two given 
temperatures, say t\ and t 2 , is of primary interest. To investigate such a link, 
the definition of D t with t real is extended to the bi-vector t ={t\di) by setting 

D t (jCj , x 2 ) = P ({F g fl F such that F(t x ) < x x and F(t 2 ) < x 2 }) . 

The extension to higher dimensions does not present any major difficulty, 
but to reduce the notational complexity we restrict our exposition to the bi- 
variate case for the remainder of this paper. 

3.3 Mixture of distribution of distributions and copulas 

Our goal is to cluster the different vertical profile distributions into K=1 
classes. To perform this task, we assume that the distribution D t can be 
expressed as a mixture of distributions 
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K 

D , G,x 2 )=Ej n kD t , k (x i ,x 1 ) 

k = 1 

where Eft = 1, 0 < 7T* < 1 and D tk represents a bi-variate distribution. We 
express the relationship between the distribution D tk and its two marginals 
by directly applying Sklar's theorem (Sklar, 1959; Nelsen, 1998). This gives 

K 

Dfi^X 1 , X 2 ) = n kC t,k {j^t lt k ( x i )’ D t 2i k ( X 2 )) 5 

k = 1 

where C t y is a copula function. There exists a variety of parametric forms to 
model this copula. In our applications, we use Frank's copula (Nelsen 1998) 
i ( (R u -IV R v 

C tk (u 9 v) = - ~ log 1 + — f , with w 5 vg [0,1], 

i°§ p t ,k y - 1 j 

where the positive parameter fi tk ^ 1 is a indicator of dependence, Q*(w,v) ~ 
uv for [3 tk T 1, C kk (u,v) ~ min (z/,v) for fi tk >1 0 and C t jfa 9 v) ~ max(z/+v- 1 ,0) 
for D t k p t k t oo. The first case, respectively the second case, corresponds to 
the independence, respectively to the total dependence. 

3.4 Parameters estimation and clustering algorithm 



The next step is to sequentially cluster the n= 16200 vertical profile 
distributions and to estimate all parameters from the previous sections. The 
chosen method is an extension to distributions of the “Nuees Dynamiques” 
method (Diday et al., 1974). Given a partition IT = {lTi,...,n^} (the first one 
is randomly generated), the clustering algorithm constitutes of 3 main steps: 
(1) estimation of the mixture proportions { 7 ^}, (2) estimation of other 
mixture parameters, (y t \,h Ya,k) for the Beta laws and {Pt,k} for the copula's 
parameter, (3) re-allocation of all individuals co z into K new classes with 
z=l,...,w. This 3 step procedure is repeated until the desired convergence is 
reached. The first step is undertaken by setting n k as the number of elements 
in the Mi class divided by the total number of individuals. Other alternatives 
can be used (Celeux and Govaert, 1993). The second step is realized by 
maximizing the classifier log-likelihood 



wEl 1 -ogKA 0 



; 0 k )], with d -\f t 



where co z = {i: r fa) < x\, f fa) < x} and d k fx\ ,x 2 ; ft) is the density derived 
from D kt (x\,xydk). The last step is implemented by defining the new classes 
as Yl k = {co: nfafa; 6 k ) > max{7i/J/yco;0/): /=1,...,^}}. 

3.5 Application to the temperature profiles 

Figure 1 shows a classification of the 16200 vertical temperature profiles 
into 7 clusters. This result was obtained after applying the clustering 
procedure for two iterations. Although no spatial dependence was introduced 
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in the model, the spatial coherence obtained from the clustering procedure is 
a positive indicator of the quality of the algorithm. From a scientific 
perspective, the clusters provides realistic classes. Cluster 4 can be identified 
as a “tropical class”. Two “polar” clusters can be linked to the winter season 
at the South pole (cluster 1) and to the summer season at the North pole 
(cluster 7). Cluster 3 makes the transition between moderate and tropical 
zones, cluster 6 between polar and moderate zones. The high mountains are 
clearly identified (Himalaya, Andes). 



Decomp ND (Frank -dist beta) 7cl T(225,265) 15/12 




Figure 1. Clustering of the 16200 temperature vertical profiles into 7 clusters. 

3.6 Extension to multi-dimensional distributions 



In the previous sections, we exclusively focused on the temperature 
profiles but extending the procedure to multi-dimensional atmospheric 
vectors, e.g. the bi-variate vector of the temperature and humidity profiles, 
will greatly increase the range of applications of this work. The coupling 
method is based on the following mixture decomposition 



k = 1 






with 



x 



F) 



=(*r, 4 '), 



where the integer r represents either the temperature (r= 1) or the humidity 
(r= 2). Then this couple of distributions can be linked by Sklar's theorem. 
There exists a copula function C such that 



D(x ( 1 ) ,x (2) ) = C(Z) (1) (x (1) ),D (2) (x (2) )). 



Although the notations become more complex, the same overall principles 
of the algorithm described in Section 3.4 can be applied. A main difference 
is that, in addition of setting two temperature levels (/ (1) we also need 
to fix two humidity levels Figure 2 represents the output of such a 

coupling procedure. Cluster 7, respectively cluster 1, corresponds to the 
winter season at the North pole, respectively the summer season at the South 
pole. This two regions were already identified in the temperature clustering, 
but additional variations are generated from humidity in Figure 2. Two 
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tropical classes are identified, very humid (cluster 4) and humid (clusters 3). 
Cluster 4 is in better agreement with existing humid zones than the ones 
obtained before. The other clusters represent transition regions from tropical 
classes (hot and humid) to polar classes (dry and cold). 

7 classes (cop Frank - dist beta) T(225,265), H( 0.00003,0.006) IS/ 12/98 OH 




I ~ 2 1 4 5 • 7 

Figure 2. Clustering in 7 classes by coupling the temperature and the humidity. 



4 . CONCLUSIONS 

In the first part of this work, we showed that extending the normal 
distribution to the general multivariate skew-normal distribution for state- 
space models did neither reduce the flexibility nor the traceability of the 
operations associated with Kalman filtering. To the contrary, the 
introduction of a few skewness parameters provides a simple source of 
asymmetry needed for many applications. Further research is currently 
conducted to illustrate the capabilities of such extended state-space models 
for real case studies. 

By introducing a higher abstraction level in clustering methods, the 
concept of distributions of distributions and copulas extends the applicability 
of current procedures (Diday et al., 2001; Vrac, 2002). In addition, it allows 
to model different dependence levels for probabilistic data, internal 
dependencies inside a distribution of distributions (Section 3.5) and external 
ones, for example between the humidity and temperature vertical profile 
distributions. Besides providing realistic climatic classifications, these 
results emphasize the strong potential of this clustering method at helping 
the understanding of other atmospheric variables and their inter- 
relationships. Other algorithms have been generalized in the same way with 
copulas : the theoretically extensions of the algorithms EM, SEM, SAEM, 
and CEM was derived by Vrac (2002). Comparisons between these extended 
methods and "classical” algorithms of classification indicate that the 
procedures based on the concept of distributions of distributions perform 
better in the context of climatic studies (Vrac, 2002). It is worthwhile to note 
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that the proposed method can also be applied to classical numerical 
observations and functional data. Finally, multi-variate versions of the 
algorithm exist and are based on multidimensional generalized Archimedian 
copulas (Vrac 2002). This extension to multi-variate cases constitutes a 
strong axis of current research. 
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Abstract: A geostatistical optimization algorithm is proposed for super-resolution land 

cover classification from remotely sensed imagery. The algorithm requires as 
input, a soft classification of land cover obtained from a remotely sensed 
image. A super-resolution (sub-pixel scale) grid is defined. The soft land cover 
proportions (pixel scale) are then transformed into a hard classification (sub- 
pixel scale) by allocating hard classes randomly to the sub-pixels. The number 
allocated per pixel is determined in proportion to the original land cover 
proportion per pixel. The algorithm optimizes the match between a target and 
current realization of the two-point histogram by swapping sub-pixel classes 
within pixels such that the original class proportions defined per pixel are 
maintained. The algorithm is demonstrated for two simple simulated images. 
The advantages of the approach are its ability to recreate any target spatial 
distribution and to work with features that are both large and small relative to 
the pixel size, in combination. 



1. INTRODUCTION 

Land cover is an important variable for many scientific investigations and 
operational applications. For example, land cover data are required to 
provide boundary conditions for atmospheric (e.g., global climate 
circulation) modelling, hydrological modelling, geomorphological modelling 
and so on. However, accurate land cover data at the required (coarse) spatial 
resolution are often not available because of the difficulties and expense of 
surveying large areas. Remote sensing has been invaluable for mapping, and 
ultimately monitoring, land cover over large areas because of the complete 
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synoptic coverage provided. However, current state-of-the-art techniques do 
not make full use of the available data within remotely sensed images. In 
particular, techniques are limited by the spatial resolution of the original 
multiple waveband imagery. The objective of this paper was to demonstrate 
a geostatistical technique for land cover classification from remotely sensed 
imagery that actually maps at a spatial resolution that is finer that that of the 
original imagery, thus, making greater use of the available data. 

Hard classification techniques, such as maximum likelihood (ML) 
classification, have been popular in remote sensing for many years (e.g., 
Thomas, 1987). In hard classification, every pixel is allocated to one class 
(for ML classification, the class to which it is most likely to belong). A 
criticism of hard classification is that many pixels actually contain a mixture 
of land cover classes. Such pixels are referred to as ‘mixed’. Mixed pixels 
can arise for two main reasons: (i) more than one distinct (crisp) class is 
represented within a pixel and (ii) classes intergrade within a pixel (e.g., an 
ecotone). In (i) the mixing leads to ambiguity and in (ii) the mixing leads to 
vagueness demanding the definition of fuzzy sets (e.g., Bezdek et al., 1984). 
The concern in this paper is the unmixing of pixels that contain crisp classes. 

The mixed pixel problem led to the adoption of techniques for soft 
classification, originally for geological remote sensing (Adams, et al., 1985). 
Soft classifiers (also sometimes referred to as fuzzy classifiers) map each 
pixel onto many classes and assign membership values to each class which 
predict the proportion of the pixel that each class represents. Examples 
include the linear mixture model (Adams, et al., 1985), fuzzy c-means 
(Bezdek et al., 1984), feed- forward, back-propagation neural networks 
(Atkinson et al., 1997) and support vector machines (Brown et al., 1999). 
Soft classification represents greater information in the land cover prediction 
at no extra cost: hard classification simply omits the land cover proportion 
information, presenting only the most likely class. Indeed, the only drawback 
of soft classification appears to be the difficulty in displaying more than 
three or four class proportions simultaneously in the same map. 

While soft classification is preferable to hard classification, almost 
ubiquitously, because class proportions are predicted per pixel, no attempt is 
made to predict where, within each pixel, the land cover actually exists. 
Thus, if a soft classifier has predicted that a pixel contains 50% woodland, 
30% grassland and 20% built-land, the user (e.g., decision-maker) does not 
know where the woodland, grassland and heathland patches are located 
within the pixel. The new technique demonstrated in this paper is designed 
to post-process a soft classified remotely sensed image to classify (in a hard 
sense) land cover at the sub-pixel scale. This objective is referred to as 
super-resolution classification. 
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Several researchers have attempted super-resolution mapping based on 
remotely sensed imagery of radiance or reflectance (e.g., Flack et ah, 1994; 
Foody, 1998, Steinwendner et ah, 1998; Schneider, 1999). Atkinson (1997) 
suggested super-resolution mapping based solely on the output from a soft 
classification. The idea proposed was to convert soft land cover proportions 
to hard (per-sub-pixel) land cover classes (that is, at a finer spatial 
resolution). The most intuitive (most visually appealing) solution was 
attained by maximizing the spatial statistical correlation between 
neighbouring sub-pixels. The basic idea was to maximize the spatial 
correlation between neighbouring sub-pixels under the constraint that the 
original pixel proportions were maintained (Atkinson, 1997). This basic idea 
was extended by Verhoeye et al. (2000). A fundamental limitation of the 
approach was that the relation between pixels and sub-pixels was modelled, 
thereby mixing scales of measurement (Atkinson and Tate, 2000). 

A solution to the super-resolution problem may be achieved by 
comparing sub-pixels to sub-pixels meaning that a non-linear model should 
be used to achieve solution. A pixel swapping algorithm in which the goal is 
to maximize the spatial correlation between neighbours was demonstrated 
recently for simple simulated images (Atkinson, 2001). This algorithm 
works for both the binary (e.g., target detection) and categorical (e.g., land 
cover) cases. 

Recently, Tatem et al. (2001a) developed a Hopfield neural network 
(FINN) technique (Hopfield and Tank, 1985) for super-resolution target 
mapping. The HNN was used essentially as an optimization tool. To solve 
the super-resolution mapping problem, with the pixel-level class proportions 
as initial conditions, the HNN architecture was arranged as a super- 
resolution grid of sub-pixels. The HNN was then set up to minimize an 
energy function that comprises a goal and constraints: 

E = k x G + k 2 C + b (1) 

where G is the goal (to increase the spatial correlation between 
neighbouring sub-pixels), C is the constraint (that original class proportions 
per-pixel are maintained), b is a bias term and and are weights. The HNN 
was applied initially to detect targets (two-class problem) (Tatem et al., 
2001a), but eventually extended to super-resolution land cover mapping 
(multiple class problem) (Tatem et al., 2001b). 

Tatem et al. (2002) developed an extension of the HNN super-resolution 
mapping technique in which the spatial clustering goal was replaced by a K- 
class variogram-matching goal. This new goal allowed replication of spatial 
pattern, which was particularly useful for objects that were smaller than a 
pixel. In this paper, a geostatistical optimization algorithm is described 
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which is capable of producing super-resolution maps from soft classified 
input images. It represents an alternative to the HNN variogram-matching 
algorithm. 



2. THEORY 



The spatial optimization algorithm used here is based on the two-point 
histogram as defined and used in the program ANNEAL. for, which is part of 
the GSLIB library of Fortran routines (Deutsch and Journel, 1998). The 
present optimization algorithm was coded in S-PLUS. The two-point 
histogram and its use in optimization are presented, followed by a 
description of its use in super-resolution classification. 



2.1 Two-Point Histogram 



This section, which describes the two-point histogram, is adapted from 
Deutsch and Journel (1998). Given a random variable Z that can take one of 
k= 1, ..., K outcomes (i.e., a categorical variable) the two-point histogram for 
a particular lag (distance and direction of separation) h is the set of all 
bivariate transition probabilities: 



fZ(u) e category £, j 
[Z(u + h) e category k'\ 



( 2 ) 



independent of u, for all k , k = 1, ..., K. The objective function 
corresponding to the two-point histogram control statistic is as follows: 



° = if zztrr g (h)-^“ 

h \k=\ k'=\ 




( 3 ) 



where p t ™^ in8 ( h) are the target transition probabilities, for example, 
calculated from a training image and p^ izatbm ( h) are the corresponding 

transition probabilities of the realization image (i.e., the current image being 
altered). 



2.2 Optimization Algorithm 

While equation 2 lies at the heart of the optimization algorithm, it is 
insufficient alone for super-resolution mapping. First, a scheme must be 
devised for altering the sub-pixel values. This can either be via a change to 
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the attribute (as for FINN, ANNEAL.for) or via a swap in sub-pixel location 
(as here). In either case, it is important that the optimization goal (Equation 
2) is constrained so that the original pixel proportions are maintained as 
closely as possible. Where the attribute values are changed this constraint 
should be added to the goal to form a single energy function. In this way, the 
original pixel proportions will be maintained approximately in the solution. 
Where sub-pixels are swapped, the constraint can either be added to the goal, 
or the sub-pixels to be swapped can be constrained to the same pixel. The 
latter strategy, which results in the pixel proportions being maintained 
perfectly in the solution, is adapted here. 

2.3 Initialization 

The decision to swap sub-pixels within pixels means that the attribute 
values in the initial image must correspond to those desired in the solution. 
In the present case, this means that the pixels of the initial image must 
contain hard classified sub-pixels, with the number of sub-pixels per class 
determined in proportion to the class proportions. Thus, in a pixel of 10 by 
10 sub-pixels (for which proportions are predicted as woodland (50%), 
grassland (30%) and built-land (20%)) there will be 50 sub-pixels of 
woodland, 30 of grassland and 20 of built-land. The image to be presented to 
the optimization algorithm is initialized by distributing spatially the required 
number of sub-pixels randomly within each pixel. 

2.4 Summary of algorithm 

The full algorithm is summarized below. 

1. Create current image at sub-pixel scale by randomly distributing land 
cover proportions 

2. Calculate two-point histogram for training image 

3. Calculate two-point histogram for current image 

4. For each iteration 

5. For every pixel 

6. For every sub-pixel (visited in random order within the current pixel) 

7. Compare to another sub-pixel (drawn randomly from the same pixel) 

8. If swap results in smaller objective function, retain swap and update 
two-point histogram. 

Two checks were added to the algorithm to increase its efficiency. First, 
it was found that many pixels contained only one land cover class. Such 
pixels were ignored. It should be noted however, that sub-pixels within such 
pixels may be used in comparison with sub-pixels within adjacent pixels 
because the two-point histogram was computed for eight directions (at 45° to 
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each other) and at various lags. Second, sub-pixels were compared only if 
their classes were different. While not fast, the current implementation in S- 
PLUS was sufficient to demonstrate the utility of the optimization algorithm 
on simulated images. It is anticipated that the algorithm will be written in C 
or C++ in the future for operational use. 



3. SIMULATED DATA 

3.1 Circles 

To provide a simple test of the optimization algorithm two circles of 
different class were simulated on a background in an image of 35 by 35 sub- 
pixels (Figure la). The spatial resolution of the image was then coarsened by 
a factor of 7, to provide an image of 5 by 5 pixels. The proportions of each 
of the three classes in each pixel of the image are shown in Figure lb-d. 
These three images are taken to represent the output of a soft classifier. That 
is, Figure lb-d represents the final result of applying a soft classifier to a 
remotely sensed image (i.e., the current state-of-the-art solution). It also 
represents the sole data input to the geostatistical optimization algorithm. 

3.2 Simulated remotely sensed image 

A simple Boolean simulation was used to provide a more realistic image 
with which to test the optimization algorithm. First, n w =l rectangles of 
varying height and width r ~ U (min r , max r ) were drawn at locations / - U 
(min/, max/) where min r = 2, max r = 14, min / = 0 and max/ = n ) = 46.7 

sub-pixels, where n p is the number of pixels along the image edge and n sp is 
the number of sub-pixels along a pixel edge. These n w rectangles, which 
superimposed themselves naturally, were meant to simulate an area of 
woodland (Figure 2a). Second, n b = 20 rectangles of varying height and 
width r (min r = 2 and max r = 3) were drawn at locations / with min / = 
xyy n ) and max/ = n p .n sp - These rectangles, some of which were 

superimposed, were meant to simulate built-land (i.e., buildings). Because 
buildings were simulated after woodland, the buildings appear to nestle 
inside the woodland, as desired. 

The third class, the background, is meant to represent grassland. The 
effect of overlapping the positions of draws of woodland and build-land 
objects is to create several pixels in the centre of the image that contain all 
three classes. 




Super -resolution classification using the two point histogram 



21 







TO 




to 





Figure 1. (a) Target image (35 by 35 sub-pixels) defined at the sub-pixel scale and (b-d) 
class proportions (5 by 5 pixels) defined at the pixel scale for classes (b) background, (c) left 
circle and (d) right circle. Note that images b-d provide the only input to the optimization 

algorithm. 

m to 




Figure 2. (a) Target image (70 by 70 sub-pixels) defined at the sub-pixel scale and (b-d) 
land cover class proportions (10 by 10 pixels) defined at the pixel scale for simulated classes 
(b) grassland, (c) woodland and (d) built-land. Note that images b-d provide the only data 
input to the optimization algorithm. 

The spatial resolution of the image (Figure 2a) was coarsened by a factor 
of 7 to create an image of 10 by 10 pixels representing proportional land 




22 



P.M. Atkinson 



cover (Figure 2b-d). Figure 2b-d is taken to represent the output from a soft 
classifier applied to a remotely sensed image. Again, it represents the state- 
of-the-art solution, and the only data input to the geostatistical optimization 
algorithm. 



4. RESULTS 

4.1 Initialization 

To provide an input to the optimization algorithm, the pixel proportions 
represented in Figures lb-d and 2b-d were allocated to locations selected 
randomly, as described in section 2.3. The resulting images are shown in 
Figure 3. It is interesting to note that this sub-pixel allocation presents a 
useful method of visualizing a soft classification, particularly where the 
number of classes K > 3. However, that benefit is coincidental to the present 
goal. 

(a) (b) 




Figure 3. Initial images for (a) circles (35 by 35 sub-pixels) and (b) land cover (70 by 70 
sub-pixels). For each pixel, the (soft) class proportion for each class defined at the pixel scale 
(Figure lb-d) is allocated (hard classification) to sub-pixels whose spatial location is defined 
randomly. The number of sub-pixels for each class is determined by the pixel scale class 
proportion. This image is the input to the optimization algorithm. 



4.2 Training 

In the practical or operational situation, training (i.e., definition of the 
target two-point histogram for use in Equation 2) would be provided by a 
training image with the desired super-resolution. A practical example might 
be super-resolution classification of Landsat Thematic Mapper (TM) 
imagery (spatial resolution of 30 m by 30 m) via training with a classified 
IKONOS image (spatial resolution of 4 m by 4 m). This strategy is sensible 
because Landsat TM images cover a much larger area than IKONOS images. 
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Many other sensor combinations can be used to illustrate the utility of this 
approach. 

In the absence of training data, and to test the utility of the algorithm in 
the ideal case, the training two-point histogram was obtained from the target 
image. This choice is justified because (i) the two-point histogram is 
calculated at a limited number of lags only ( h max = n sp ) such that it contains 

only partial information on the original image and (ii) while in the practical 
situation the accuracy of the predicted super-resolution classification will 
depend on the extent to which the training image represents the spatial 
character of the true target, that is not the present interest. 

4.3 Optimization 

The first six iterations of the optimization algorithm are shown in Figure 
4 (circles) and Figure 5 (remotely sensed classification). The super- 
resolution classifications achieved after 100 iterations are shown in Figure 
6a (circles) and Figure 6b (remotely sensed classification). Additional 
iteration may have decreased the energy function (Equation 2) further, but 
the results are sufficient to illustrate the utility of the technique. 
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Figure 4. The first six iterations of the optimization algorithm, hach iteration involves, 
for every pixel, a comparison of every sub-pixel (chosen in a random sequence) with another 
sub-pixel chosen randomly from the same pixel. 
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Figure 4. The first six iterations of the optimization algorithm. Each iteration involves, 
for every pixel, a comparison of every sub-pixel (chosen in a random sequence) with another 
sub-pixel chosen randomly from the same pixel. 

The algorithm appears to have reproduced the circles almost perfectly, 
and the land cover target reasonably closely in spatial character (built-land). 




Figure 5. The first six iterations of the optimization algorithm. Each iteration involves, 
for every pixel, a comparison of every sub-pixel (chosen in a random sequence) with another 
sub-pixel chosen randomly from the same pixel. 
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4.4 Assessment 

The circles represent Woodcock and Strahler’s (1987) H-resolution case 
in which the spatial resolution is fine relative to the size of objects in the 
scene (in this case the circle diameter). In the H-resolution case, the target 
(Figure la) can be reproduced with a spatial clustering algorithm in which 
the objective is to maximise the spatial correlation between neighbours (c.f., 
Atkinson, 2001). The advantage of the present technique is its ability to 
match any prior target spatial distribution. The potential of this is not 
realized fully for the circles, although the example does illustrate the 
generality of the technique. 

(a) (b) 




Figure 6. Super-resolution images of (a) circles (35 by 35 sub-pixels) and (b) land cover 
(70 by 70 sub-pixels) after 100 iterations. 

While the woodland (and grassland) areas in the simulated remotely 
sensed scene represent the H-resolution case, this is to a lesser extent than 
for the circles because of the greater curvature of the feature (object) 
boundaries. The parcels of built-land, however, represent the L-resolution 
case in which the spatial resolution is coarse relative to the size of objects in 
the scene. In the L-resolution case, a spatial clustering algorithm would fail 
to provide a realistic solution, joining together patches of a given class where 
possible. Not only does the optimization algorithm recreate the spatial 
character of the target, but it also allows a realistic solution for both the H- 
resolution (woodland, grassland) and L-resolution (built-land) cases 
simultaneously. 

The buildings in the solution do not match the buildings in the target on a 
sub-pixel-by-sub-pixel basis. Neither are they expected to. Insufficient data 
and constraints are provided to achieve such spatial definition. However, the 
spatial character of the target is recreated reasonably well. Such a map would 
find utility in many circumstances, but particularly as input to spatially 
distributed process models (e.g., as boundary conditions for flood inundation 
models where water flows around buildings etc.). 
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5. DISCUSSION 

While the algorithm presented represents a useful basic tool for super- 
resolution classification, several possible refinements have been identified 
and these are described in this section. 

5.1 Simulated annealing 

Convergence appears to be fast (the main features are identifiable within 
the first eight iterations). However, it should be remembered that n sp by n sp 
comparisons are made per pixel at each iteration (where n sp is, in this case, 
7). The rapid rate of convergence may be a cause for concern in that it is 
possible for the solution to become trapped in local minima. If that is a 
supportable concern then the algorithm can be modified readily to include 
full spatial simulated annealing with an annealing schedule designed to 
avoid local minima (e.g., van Groenigen, 1999). A further possibility is to 
run the algorithm several times with different initializations and compare the 
solutions. However, for the present application, where the solution is 
constrained by the original pixel proportions, the risk of local minima is 
believed to be small. 

5.2 Non-stationarity and regularization 

It is clear from the target image that the local spatial character of 
variation differs from place-to-place. It might be useful then if the target 
two-point histogram also varied from place-to-place. The problem, of 
course, is that in the practical situation the small training image available at 
the target spatial resolution will not relate to any particular spatial location in 
the image being optimized. Some location-specific information is provided, 
however, by the initial image output from the soft classifier (i.e., Figure lb-d 
and 2b-d). In particular, the local two-point histogram may be computed at 
the pixel-scale. The problem is that this information is provided at the pixel 
scale, whereas information is required at the sub-pixel scale. A solution to 
this problem may be possible via regularization of the modelled sub-pixel 
two-point histogram, thereby providing a link between the two scales of 
measurement (Journel and Huijbregts, 1978; Jupp et al ., 1988). This will be 
the subject of future research. 

5.3 Error and the point-spread function 

Two issues which have been deliberately overlooked are error and the 
point-spread function (PSF). In the simulated soft classifications (Figures 
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lb-d and 2b-d) zero error was assumed. That is, the (land cover) class 
proportions were assumed to be predicted perfectly. Research has shown that 
in practice the accuracy of soft classification is typically 80% (e.g., Atkinson 
et al ., 1997). This error will have a detrimental effect on super-resolution 
mapping. In the presence of such error it would be sensible to allow the sub- 
pixel values (i.e., the original class proportions) to change to an extent 
determined by the expectation of the error. Whether or not adequate 
convergence is possible in this essentially under-constrained scenario is 
unclear. 

The PSF provides a second practical problem that has been ignored in the 
analysis above. The PSF is usually shaped like a two-dimensional step 
function (termed a square wave response) convolved with a smoothing filter. 
It means that the remotely sensed response for a given pixel is, in part, a 
function of spatial variation in neighbouring pixels. This introduces 
ambiguity into the class proportions predicted by the soft classifier. It means 
that the pixels in the class proportion image (Figure lb-d and 2b-d) should 
actually overlap (in fact, should have the same shape as the PSF). In 
practice, therefore, it may be desirable to allow some swapping of sub-pixels 
between neighbouring pixels, restricted to zones of PSF overlap, and in 
number determined by the amount of overlap. 



6. CONCLUSION 

A new geostatistical optimization technique has been demonstrated for 
super-resolution land cover prediction from remotely sensed imagery. While 
no quantitative assessment of accuracy was provided, the results are 
encouraging. In particular, the algorithm provides acceptable solutions in 
both the H-resolution and L-resolution cases, and when both are combined. 
The super-resolution map (L-resolution case) is likely to be useful as input to 
spatially distributed process models. The optimization technique will be 
applied in future research to real remotely sensed imagery. Further, the 
algorithm will be extended to incorporate the suggestions made in section 5, 
in particular, the use of a non-stationary model. 
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Abstract: Environmental processes are rarely stationary and isotropic. In order to 

produce maps of pollutant concentration over a region where few 
measurements are available, classical kriging performs badly. To have 
more accurate maps, it is necessary from one hand to take into account 
external information, such as emissions and meteorological data and from 
the other hand to release the stationary assumption, modelling the 
variogram when kriging. In this paper we propose a non parametric 
estimator of the variogram, we study its theoretical properties and its 
behaviour on a simulation case. We use this estimator and a chemistry- 
transport model to produce maps of ozone concentration over Paris area 
and compare to maps obtained with classical kriging methods. 



1. INTRODUCTION 

Beyond forecasting the level pollutant concentration for the next day, 
air monitoring agencies are in charge of estimating the level of pollutant 
concentration over an entire area, including locations where no 
measurement has been made. 

In that aim we have two methods in mind: 

- a statistical interpolation from observations made on the monitoring 
network, 
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- a simulation by means of a chemistry-transport model on a grid. 

As the monitoring network is often too sparse and not well located, 
no interpolation method can render the complexity of the pollution 
phenomenon with only measurement data. Besides, deterministic 
physical models are very complicated and often have biases that can be 
very high. 

It is worth combining the two approaches: the chemistry- transport 
model is used to catch the phenomenon structure, while measurements 
on the monitoring network are used to adjust the outputs on the 
observations. 

In order to perform kriging, the variogram needs to be estimated. 
Usually assumptions of stationarity and isotropy are made, but it is 
widely recognized that real environment processes are rarely stationary 
and isotropic. In the case of pollutant concentration, the behaviour of two 
sites depends more on their typology (rural, urban) than on their 
distance. 

The question is to know if it is better to bypass the assumptions of 
stationarity and isotropy and use parametric fitting of variograms which 
are robust and well-tried or if it is more suitable to use more 
sophisticated variograms, adjusting well the data instationarity but with 
the drawback of unstability. 

Several attempts to model nonstationary covariance or variogram 
functions have been made, see for example Sampson and Guttorp (1992), 
Hall and Patil (1994), or Fuentes (2001). 

Following Guillot, Monestiez and Senoussi (2000), we propose a non 
parametric, kernel based estimator of the variogram for nonstationary 
fields. Firstly we show that it is admissible, that is, it is conditionally 
negative definite and propose some practical improvements. Then this 
estimator is compared to classical parametric fitting of the variogram 
through a simulation study and on a dataset of ozone concentration over 
Paris area. 



2. NON-PARAMETRIC ESTIMATOR OF THE 
VARIOGRAM 

Let us consider a second order, non stationary random field Z(s), 
defined on a domain D of IR 2 , with a covariance function C, and a 
semivariogram function T on D x /). 

Let S = {si,. ,.,s„} be a set of points of D and z(sj,t), \ <i<n,\<t<T 
be a set of T i.i.d. observations of Z at sites i. 

Let us denote by C emp the empirical covariance matrix of Z, and r emp 
the empirical semivariogram of Z, namely, 
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C emp{Uj) = Cy = ^ ^ (z(S t ,t)~Z (.S', ))(z(.S' ; , t)z{S J )) 



t = 1 



T em P (U j ) = r i j=^='Z 00 ; , t ) - Z (Sj ,t)~(z (S j ) - Z ( Sj ))) 2 

^ t = 1 



Let AT be a non-negative kernel defined on D x D. We study the 
nonparametric estimator of C and T obtained by regularization of C em p 
and r emp* 



C h (u,v) = Y J c ij 

Uj 



Z k,l K-h( U — S k’ v — S l) 



T h (u,v) = Y J Y ij 

Uj 



K h (u-s„v-Sj) 

Z k,l K h (u — S k ,V — Sj) 



where K h {uf) = K(ulh,vlh) for any positive real h. We suppose K is a 
separable kernel i.e. K{u,v) = k{u)k{v) for all (z/,v) e IR 2 . This 
assumption is sufficient to prove properties of c h and r h , but it is not 



clear that it is necessary. This estimator is quite the same as the one 
proposed by Hall and Patil (1994), in the stationary case. 



2.1 Positive definiteness of the covariance estimator 



Proposition 1 c h is positive definite. 



Proof. K is positive definite, i.e. for any (zq,...,z/ m ) in D and complex 



numbers (#i,...,# m ), jK h (ui,Uj) > 0. For any integer m , complex 

valued vector (oci,. . .,oc m ) and points zq,. . .,z/ m of D , we have 



^ Ch (Mk ’ u i ) 

k,l 






- K h( ll k ~ S n U t ~ S j) 

a k a 

Lij K h( u k- s i’ u i- s j) 



= Yc.a 

ij ij 

Uj 

The empirical covariance matrix C is positive definite, therefore, by 
Fejer's theorem, it is enough to prove the positive definiteness of the 
matrix 0 = (Of to prove that Z^ Cydy > 0. Let (/?!,...,/?„) be a complex 
valued vector, 
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Uj 



YLwPiPj 



Kh(u — Sj,v — Sj) 

X k >( u *- s JX k M-Sj) 

i j 



\r y a kPi a i Pi 

uj tj'ZK K ~ s ,) Z K ( u i ~ s j ) 

i j 



K h (u k ~ s i > 11 1 ~ s j ) 



> 0 

and © is positive definite. 



2.2 Conditional negative definiteness of the variogram 
estimator 



It is straightforward to verify that the empirical variogram is 
conditionally negative definite, that is for any complex valued vector 
(ai,...,a m ) such that £, a, = 0, a t a < 0. The point is that we can 

write 



2y tJ =af+a;-2c s 

where cT ,= j_ £,(z(.s'„t) - z(.s’,)) 2 is the empirical variance of Z(s,). So 

T 

Z Wy = Z Z«y+Z “/■ <T y Z " 2 Z a /«/ £ V 

Uj i j j i Uj 

= -l^CCjCy 

Uj 

< 0 



Proposition 2 r h is conditionally negative definite. 

Proof For any integer m , complex valued vector (ai,...,a m ) such that 
Z/a/ = 0 and points of D , we have 
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Yj a k a i Th ^ u k ’ u i) 

k,l 



zz°x 



Z t k h( u k- s i)t 



, v V k h (u t Sj ) 

+ ZZ CT ^/ v 1 TTr rrZ 



j i 



Z j k h( u i- s jY 



a. 



— K h (u k S i )U l Sj ) 

-2> c.. > a k cc,r= 

u w T.i.i K 3<‘,-s„ u i-Si) 



- -*ZZ‘.. 

;,/ k,l 

< 0 



- K h (u k -s i ,u,-s J ) 






ZvX>/, -s,,u,-s i ) 



2.3 Practical design 

The covariance and variogram estimators will be used in kriging 
systems. In the case the process Z is not continuous, for example if there 
is a nugget effect or if there is a measurement error, C{s h s ,) is not an 
estimation of Var(Z(.s,)) and r (s, •,£,•) is not necessarily 0. Hence the 
diagonal of the matrix involved in the kriging system has to be replaced 
by the empirical variance when the covariance is used and by 0 when the 
variogram is used. In both cases, it remains to check that they are 
admissible covariance or variogram. 

Case of the covariance Let C * = ( c y) be the matrix such that 

0^ = Cy if i*j 

c, = of 

Since c a X&,/ Ckj (fhi^k ~ $i) / (X k k} i($k ~ Si)) — W dii^^k^k^k ) for 
suitable h we will have c u < c a C h can be written c h = C /z + X, X 
being a diagonal matrix with positive elements, hence c h stills being 
positive definite. 

Case of the variogram Let T h ={y y) be the matrix such that 



Yij = Yu if 

Yn = 0 



For any complex valued vector (oci,. . .,a n ) such that X/ = 0 we have 
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Z apj Y y (S, ,Sj ) = X a i a jY 9 ( s i " S V>- Z a / a ,Z.y 

Uj ij i=j 

= X r ij ( s , k- | 2 r a 

Uj i 

< 0 

since nis conditionnally negative definite and y zz is a weighted sum of 
positive terms. 

It has to be noticed that if the set S is sparse and if s 0 is close to an 
isolated point s i0 , it may happen that r A (^o^) and C h {so,s^) be very close 
to r h (s i0 ,Sj) and C/ 7 (^ /0 ,y-)- When ordinary kriging is performed, this leads 
to kriging weights equal to 0 if / ^ z 0 and 1 if i = z 0 . In such a case we 
have z (so) = Z(^ z0 ) and a vanishing kriging variance. 

While it is well established that the choice of the kernel K is not a 
crucial point, the choice of the bandwith h is an important issue. Large h 
lead to oversmooth the covariance or the variogram and in the kriging 
setting measurements at monitoring sites will be considered as almost 
independant. Small h lead to the empirical covariance or variogram at 
monitoring sites, but estimation at non-monitoring sites will lack 
robustness and kriging results can be quite inappropriate. In the 
framework of kriging the choice of parameter h will generally be driven 
by the minimization of a cross validation criteria. 



3. SIMULATION EXAMPLE 

In order to check whether non parametric estimation leads to an 
improvement in kriging nonstationary fields, we simulate a Gaussian 
random function Z on a domain D = [0; 10] x [0; 10], deforming the 
space: 



cov(Z(s),Z(0) = exp(-|^(s)-^(0||) 

with = <p{x,y) = ((1 / VTT 1 1 s-0\ I ) (x-5.05), (1 / V 2 T 1 1 s-Ol I ) (y - 
5.05)). O is the point (5.05,5.05). Points of the domain which are located 
near the center are slightly correlated with their neighbours, points which 
are far from the center are highly correlated with their neighbours and 
this process is strongly nonstationary. We consider 100 realizations of Z 
at 5 1 x 5 1 sites on a regular grid of D. The simulations are carried using 
a turning bands algorithm, written by Lantuejoul, 2001. 100 points are 




